Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of commented-out-code (~50-80%) #7706

Merged
merged 2 commits into from
Sep 29, 2023

Conversation

charliermarsh
Copy link
Member

@charliermarsh charliermarsh commented Sep 29, 2023

Summary

This PR implements a variety of optimizations to improve performance of the Eradicate rule, which always shows up in all-rules benchmarks and bothers me. (These improvements are not hugely important, but it was kind of a fun Friday thing to spent a bit of time on.)

The improvements include:

  • Doing cheaper work first (checking for some explicit substrings upfront).
  • Using aho-corasick to speed an exact substring search.
  • Merging multiple regular expressions using a RegexSet.
  • Removing some unnecessary \s* and other pieces from the regular expressions (since we already trim strings before matching on them).

Test Plan

I benchmarked this function in a standalone crate using a variety of cases. Criterion reports that this version is up to 80% faster, and almost every case is at least 50% faster:

Eradicate/Detection/# Warn if we are installing over top of an existing installation. This can
                        time:   [101.84 ns 102.32 ns 102.82 ns]
                        change: [-77.166% -77.062% -76.943%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Eradicate/Detection/#from foo import eradicate
                        time:   [74.872 ns 75.096 ns 75.314 ns]
                        change: [-84.180% -84.131% -84.079%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Eradicate/Detection/# encoding: utf8
                        time:   [46.522 ns 46.862 ns 47.237 ns]
                        change: [-29.408% -28.918% -28.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
Eradicate/Detection/# Issue #999
                        time:   [16.942 ns 16.994 ns 17.058 ns]
                        change: [-57.243% -57.064% -56.815%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Eradicate/Detection/# type: ignore
                        time:   [43.074 ns 43.163 ns 43.262 ns]
                        change: [-17.614% -17.390% -17.152%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Eradicate/Detection/# user_content_type, _ = TimelineEvent.objects.using(db_alias).get_or_create(
                        time:   [209.40 ns 209.81 ns 210.23 ns]
                        change: [-32.806% -32.630% -32.470%] (p = 0.00 < 0.05)
                        Performance has improved.
Eradicate/Detection/# this is = to that :(
                        time:   [72.659 ns 73.068 ns 73.473 ns]
                        change: [-68.884% -68.775% -68.655%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
Eradicate/Detection/#except Exception:
                        time:   [92.063 ns 92.366 ns 92.691 ns]
                        change: [-64.204% -64.052% -63.909%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Eradicate/Detection/#print(1)
                        time:   [68.359 ns 68.537 ns 68.725 ns]
                        change: [-72.424% -72.356% -72.278%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
Eradicate/Detection/#'key': 1 + 1,
                        time:   [79.604 ns 79.865 ns 80.135 ns]
                        change: [-69.787% -69.667% -69.549%] (p = 0.00 < 0.05)
                        Performance has improved.

@codspeed-hq
Copy link

codspeed-hq bot commented Sep 29, 2023

CodSpeed Performance Report

Merging #7706 will improve performances by 3.56%

Comparing charlie/eradicate (91aede8) with main (e9f8b91)

Summary

⚡ 1 improvements
✅ 24 untouched benchmarks

Benchmarks breakdown

Benchmark main charlie/eradicate Change
linter/default-rules[large/dataset.py] 95.7 ms 92.4 ms +3.56%

@charliermarsh
Copy link
Member Author

Think I need to run the benchmarks locally with just this rule or something.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 29, 2023

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

@@ -62,16 +64,11 @@ pub(crate) fn comment_contains_code(line: &str, task_tags: &[String]) -> bool {
return false;
}

// Check that this is possibly code.
if CODE_INDICATORS.iter().all(|symbol| !line.contains(symbol)) {
Copy link
Member

@MichaReiser MichaReiser Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it helps performance but some of the above regex could be combined with https://docs.rs/regex/latest/regex/struct.RegexSet.html . But we may also be able to join the Regex by simply ORing them together (A | B | C) instead of having multiple regex expressions

@@ -150,7 +125,6 @@ mod tests {
fn comment_contains_code_with_print() {
assert!(comment_contains_code("#print", &[]));
assert!(comment_contains_code("#print(1)", &[]));
assert!(comment_contains_code("#print 1", &[]));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Python 2 code.

@@ -127,7 +103,6 @@ mod tests {
assert!(!comment_contains_code("# 123", &[]));
assert!(!comment_contains_code("# 123.1", &[]));
assert!(!comment_contains_code("# 1, 2, 3", &[]));
assert!(!comment_contains_code("x = 1 # x = 1", &[]));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method assumes you pass a comment, so this test doesn't really make sense.


if PARTIAL_DICTIONARY_REGEX.is_match(&line) {
// If the comment matches any of the specified positive cases, assume it's code.
if POSITIVE_CASES.is_match(line) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, combining HASH_NUMBER and ALLOWLIST_REGEX into a single regex ended up hurting performance, so I left those as two separate passes.

@charliermarsh charliermarsh added the performance Potential performance improvement label Sep 29, 2023
Comment on lines 73 to 74
wsl = { version = "0.1.0" }
aho-corasick = "1.1.1"
Copy link
Member

@konstin konstin Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which of these styles do we want (my preference is <pkg> = "x.y.z")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I prefer the version one but I broke my rule.

@charliermarsh charliermarsh enabled auto-merge (squash) September 29, 2023 20:06
@charliermarsh charliermarsh changed the title Improve performance of commented-out-code rule Improve performance of commented-out-code rule by 50-80% Sep 29, 2023
@charliermarsh charliermarsh changed the title Improve performance of commented-out-code rule by 50-80% Improve performance of commented-out-code (50-80% improvement) Sep 29, 2023
@charliermarsh charliermarsh changed the title Improve performance of commented-out-code (50-80% improvement) Improve performance of commented-out-code (~50-80%) Sep 29, 2023
@charliermarsh charliermarsh enabled auto-merge (squash) September 29, 2023 20:06
@charliermarsh charliermarsh merged commit 8c8988e into main Sep 29, 2023
15 checks passed
@charliermarsh charliermarsh deleted the charlie/eradicate branch September 29, 2023 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants